Clustering with scikit-learn
In this notebook, we will learn how to perform k-means lustering using scikit-learn in Python.
We will use cluster analysis to generate a big picture model of the weather at a local station using a minute-graunlarity data. In this dataset, we have in the order of millions records. How do we create 12 clusters our of them?
NOTE: The dataset we will use is in a large CSV file called minute_weather.csv. Please download it into the weather directory in your Week-7-MachineLearning folder. The download link is: https://drive.google.com/open?id=0B8iiZ7pSaSFZb3ItQ1l4LWRMTjg
Importing the Necessary Libraries
In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
#import utils
import pandas as pd
import numpy as np
from itertools import cycle, islice
import matplotlib.pyplot as plt
from pandas.tools.plotting import parallel_coordinates
%matplotlib inline
Creating a Pandas DataFrame from a CSV file
In [2]:
data = pd.read_csv('./weather/minute_weather.csv')
Minute Weather Data Description
As with the daily weather data, this data comes from a weather station located in San Diego, California. The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity. Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data for different seasons and weather conditions is captured.
Each row in minute_weather.csv contains weather data captured for a one-minute interval. Each row, or sample, consists of the following variables:
In [3]:
data.shape
Out[3]:
In [4]:
data.head()
Out[4]:
Data Sampling
Lots of rows, so let us sample down by taking every 10th row.
In [5]:
sampled_df = data[(data['rowID'] % 10) == 0]
sampled_df.shape
Out[5]:
Statistics
In [6]:
sampled_df.describe().transpose()
Out[6]:
In [7]:
sampled_df[sampled_df['rain_accumulation'] == 0].shape
Out[7]:
In [8]:
sampled_df[sampled_df['rain_duration'] == 0].shape
Out[8]:
Drop all the Rows with Empty rain_duration and rain_accumulation
In [9]:
del sampled_df['rain_accumulation']
del sampled_df['rain_duration']
In [10]:
rows_before = sampled_df.shape[0]
sampled_df = sampled_df.dropna()
rows_after = sampled_df.shape[0]
How many rows did we drop ?
In [11]:
rows_before - rows_after
Out[11]:
In [12]:
sampled_df.columns
Out[12]:
Select Features of Interest for Clustering
In [13]:
features = ['air_pressure', 'air_temp', 'avg_wind_direction', 'avg_wind_speed', 'max_wind_direction',
'max_wind_speed','relative_humidity']
In [14]:
select_df = sampled_df[features]
In [15]:
select_df.columns
Out[15]:
In [16]:
select_df
Out[16]:
Scale the Features using StandardScaler
In [17]:
X = StandardScaler().fit_transform(select_df)
X
Out[17]:
Use k-Means Clustering
In [18]:
kmeans = KMeans(n_clusters=12)
model = kmeans.fit(X)
print("model\n", model)
What are the centers of 12 clusters we formed ?
In [22]:
centers = model.cluster_centers_
centers
Out[22]:
Plots
Let us first create some utility functions which will help us in plotting graphs:
In [23]:
# Function that creates a DataFrame with a column for Cluster Number
def pd_centers(featuresUsed, centers):
colNames = list(featuresUsed)
colNames.append('prediction')
# Zip with a column called 'prediction' (index)
Z = [np.append(A, index) for index, A in enumerate(centers)]
# Convert to pandas data frame for plotting
P = pd.DataFrame(Z, columns=colNames)
P['prediction'] = P['prediction'].astype(int)
return P
In [47]:
# Function that creates Parallel Plots
def parallel_plot(data):
my_colors = list(islice(cycle(['b', 'r', 'g', 'y', 'k']), None, len(data)))
#print(my_colors)
plt.figure(figsize=(15,8)).gca().axes.set_ylim([-3,+3])
parallel_coordinates(data, 'prediction', color = my_colors, marker='o')
In [48]:
P = pd_centers(features, centers)
P
Out[48]:
In [49]:
parallel_plot(P[P['relative_humidity'] < -0.5])
In [50]:
parallel_plot(P[P['air_temp'] > 0.5])
In [51]:
parallel_plot(P[(P['relative_humidity'] > 0.5) & (P['air_temp'] < 0.5)])